Anaconda and Jupyter¶

In this notebook, we will work with the following:

  • Importing packages and namespaces.
  • Using alternative interfaces to Python.
  • Doing cool things with Jupyter.
  • Seeing some examples of visualization.
  • Considering some challenges of Jupyter.

Importing packages¶

By convention, imports go at the top of a Python script or notebook (see PEP 8). In relevant part:

Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.

Imports should be grouped in the following order:

  • Standard library imports.
  • Related third party imports.
  • Local application/library specific imports.

You should put a blank line between each group of imports.

In [ ]:
# standard library
import sys
import time

# third party
import numpy as np
import pandas as pd
import plotly.express as px
from textblob import TextBlob
In [ ]:
pd.set_option("mode.copy_on_write", True)

Note a few things in the block above.

  1. The import sys is the simplest version.
  2. For some things we use a lot (and also by convention), we would like to abbreviate the names of some packages. For example, pandas is often imported as pd, both because it is used often and also by convention (see pandas documentation).
  3. We can also import particular things from a package, like the class TextBlob from the package textblob.
  4. The lines that start with # are comments. Those lines are not executed by Python, and they are useful for us to make notes about what we are doing.

Let's see how these work in action.

In [ ]:
print(sys.executable)
/usr/local/bin/python

Note that to find the contents of this attribute executable within the sys package, we have to use the package namespace sys. Most about namespaces below.

Namespaces¶

Somewhat abstactly, the python docs define a namespace as follows.

A namespace is a mapping from names to objects.

For our purposes, we can think of them as paths to get to tools of interest. This topic goes (much) deeper, but a more instrumental understanding is fine for our use.

If we want to know what is contained in a namespace, we can easily find out with the dir() built-in function.

In [ ]:
dir(sys)
Out[ ]:
['__breakpointhook__',
 '__displayhook__',
 '__doc__',
 '__excepthook__',
 '__interactivehook__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '__stderr__',
 '__stdin__',
 '__stdout__',
 '__unraisablehook__',
 '_base_executable',
 '_clear_type_cache',
 '_current_exceptions',
 '_current_frames',
 '_debugmallocstats',
 '_framework',
 '_getframe',
 '_getquickenedcount',
 '_git',
 '_home',
 '_stdlib_dir',
 '_xoptions',
 'abiflags',
 'addaudithook',
 'api_version',
 'argv',
 'audit',
 'base_exec_prefix',
 'base_prefix',
 'breakpointhook',
 'builtin_module_names',
 'byteorder',
 'call_tracing',
 'copyright',
 'displayhook',
 'dont_write_bytecode',
 'exc_info',
 'excepthook',
 'exception',
 'exec_prefix',
 'executable',
 'exit',
 'flags',
 'float_info',
 'float_repr_style',
 'get_asyncgen_hooks',
 'get_coroutine_origin_tracking_depth',
 'get_int_max_str_digits',
 'getallocatedblocks',
 'getdefaultencoding',
 'getdlopenflags',
 'getfilesystemencodeerrors',
 'getfilesystemencoding',
 'getprofile',
 'getrecursionlimit',
 'getrefcount',
 'getsizeof',
 'getswitchinterval',
 'gettrace',
 'hash_info',
 'hexversion',
 'implementation',
 'int_info',
 'intern',
 'is_finalizing',
 'last_traceback',
 'last_type',
 'last_value',
 'maxsize',
 'maxunicode',
 'meta_path',
 'modules',
 'orig_argv',
 'path',
 'path_hooks',
 'path_importer_cache',
 'platform',
 'platlibdir',
 'prefix',
 'ps1',
 'ps2',
 'ps3',
 'pycache_prefix',
 'set_asyncgen_hooks',
 'set_coroutine_origin_tracking_depth',
 'set_int_max_str_digits',
 'setdlopenflags',
 'setprofile',
 'setrecursionlimit',
 'setswitchinterval',
 'settrace',
 'stderr',
 'stdin',
 'stdlib_module_names',
 'stdout',
 'thread_info',
 'unraisablehook',
 'version',
 'version_info',
 'warnoptions']

While we might think of namespaces as synonymous with packages, it's more general than that. Individual objects have their own namespaces, like the TextBlob class we imported earlier.

In [ ]:
dir(TextBlob)
Out[ ]:
['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_cmpkey',
 '_compare',
 '_create_sentence_objects',
 '_strkey',
 'analyzer',
 'classify',
 'correct',
 'detect_language',
 'ends_with',
 'endswith',
 'find',
 'format',
 'index',
 'join',
 'json',
 'lower',
 'ngrams',
 'noun_phrases',
 'np_counts',
 'np_extractor',
 'parse',
 'parser',
 'polarity',
 'pos_tagger',
 'pos_tags',
 'raw_sentences',
 'replace',
 'rfind',
 'rindex',
 'sentences',
 'sentiment',
 'sentiment_assessments',
 'serialized',
 'split',
 'starts_with',
 'startswith',
 'strip',
 'subjectivity',
 'tags',
 'title',
 'to_json',
 'tokenize',
 'tokenizer',
 'tokens',
 'translate',
 'translator',
 'upper',
 'word_counts',
 'words']

We can also look at what is in the global namespace.

In [ ]:
dir()
Out[ ]:
['In',
 'Out',
 'TextBlob',
 '_',
 '_5',
 '_6',
 '__',
 '___',
 '__builtin__',
 '__builtins__',
 '__doc__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '__vsc_ipynb_file__',
 '_dh',
 '_i',
 '_i1',
 '_i2',
 '_i3',
 '_i4',
 '_i5',
 '_i6',
 '_i7',
 '_ih',
 '_ii',
 '_iii',
 '_oh',
 'exit',
 'get_ipython',
 'np',
 'open',
 'pd',
 'px',
 'quit',
 'sys',
 'time']

Note that we see the sys and pd packages that we imported, and we also see the TextBlob class that we imported from its package.

import this¶

A fun import is The Zen of Python philosophy, which can be accessed by importing this. Note that I'm slightly breaking the rules above for the purposes of illustration.

In [ ]:
import this  # noqa: E402, F401
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Alternative interfaces for Python.¶

As we'll talk about later, Jupyter notebooks are a great interface for working with Python (or R or a number of other kernels). However, they are not the only game in town.

  1. Python interpreter. We can access the Python interpreter from the terminal.
  2. Running a script from the terminal. We can also make our own script and run it from the terminal.

Jupyter¶

The Jupyter lab interface and notebooks provide a number of conveniences for research.

  • Jupyter Lab: text editor, terminal, window layouts.
  • Rich text: bold, italic, headings.
  • Bullets, lists, and code (non-executing).
  • Links, images, and equations.
  • Display of graphics.
  • Convenience items with cell magics.

Rich text¶

Jupyter uses the simple markdown syntax for formatting text. There are some extensions and differences from original markdown, so you may find the Jupyter Notebooks docs to be a better reference.

  • Bold a word by enclosing it in pairs of asterisks: **Bold**.
  • Italicize a word by enclosing it in single asterisks: *Italicise*.
  • *Do both* with three asterisks: ***Do both***.
  • We can also use headings by starting the line with one or more pound signs, where one is a top-level heading: #.

First heading¶

Second heading¶

Third heading¶

# First heading
## Second heading
### Third heading

Bullets and lists¶

  • Bullets can be made by beginning a line with a hyphen and a space: - Bullets. . ..

Numbered lists start with a number, period, and a space:

  1. First
  2. Second
  3. Third

Note that they all start with 1., and markdown handles numbering for us. We could, of course, number them ourselves.

1. First
1. Second
1. Third

We can also nest lists and types by indenting:

  • Bullet
    1. Nested list item
    2. Another one
  • Another bullet
    1. More lists
      • More bullets
- Bullet
    1. Nested list item
    1. Another one
- Another bullet
    1. More lists
        - More bullets

Code¶

We can reference code in two ways. First, we can use inline code like import this by using backticks ` to enclode the code: `import this`. Second, we can make code blocks by using beginning and ending lines with three backticks: ```. Do note that I'm having to be tricky to display backticks inside of code.

def f_to_c(temp_f):
    return (temp_f - 32) * 5/9

We can make it a little nicer (with syntax highlighting) by adding the code type to the first line: ```python.

def f_to_c(temp_f):
    return (temp_f - 32) * 5/9

Links and images.¶

We can add links, like one to my github page, using the text in brackets followed by the link in parentheses: [github page](https://github.com/jtkiley).

We can add images by using similar syntax to point to an image: ![alt text](../_img/pandas_logo.svg).

alt text

Equations¶

Similar to code, we can also use math and equations inline and in blocks. For inline math, like the union of a set $S \cup T = \{x \mid x \in S \vee x \in T\}$, we can use a single dollar sign to denote math: $S \cup T = \{x \mid x \in S \vee x \in T\}$.

We can also use blocks by using beginning and ending lines with two dollar signs: $$.

$$ \operatorname{MSE}=\frac{1}{n}\sum_{i=1}^n(Y_i-\hat{Y_i})^2 $$
$$
\operatorname{MSE}=\frac{1}{n}\sum_{i=1}^n(Y_i-\hat{Y_i})^2
$$

There are many math features, including matrices:

$$ A = \begin{pmatrix} \underbrace{\begin{matrix} a_{0,0} \\ a_{1,0} \\ \vdots \\ a_{m-1,0} \end{matrix}}_{a_0} & \underbrace{\begin{matrix} a_{0,1} \\ a_{1,1} \\ \vdots \\ a_{m-1,1} \end{matrix}}_{a_1} & \begin{matrix} \dots \\ \dots \\ \ddots \\ \dots \end{matrix} & \underbrace{\begin{matrix} a_{0,n-1} \\ a_{1,n-1} \\ \vdots \\ a_{m-1,n-1} \end{matrix}}_{a_{n-1}} \\ \end{pmatrix} $$
$$ A = \begin{pmatrix}
\underbrace{\begin{matrix} a_{0,0} \\ a_{1,0} \\ \vdots \\ a_{m-1,0} \end{matrix}}_{a_0} &
\underbrace{\begin{matrix} a_{0,1} \\ a_{1,1} \\ \vdots \\ a_{m-1,1} \end{matrix}}_{a_1} &
\begin{matrix} \dots \\ \dots \\ \ddots \\ \dots \end{matrix} &
\underbrace{\begin{matrix} a_{0,n-1} \\ a_{1,n-1} \\ \vdots \\ a_{m-1,n-1} \end{matrix}}_{a_{n-1}} \\
\end{pmatrix}
$$

Visualization¶

We can also display graphics that are output from our work with data.

In [ ]:
# Create some random data
data1 = pd.DataFrame(np.random.rand(200, 4), columns=[letter for letter in "ABCD"])
In [ ]:
# Display the top of the dataframe
data1.head()
Out[ ]:
A B C D
0 0.241023 0.762872 0.763842 0.557562
1 0.787578 0.692489 0.639784 0.742105
2 0.731753 0.116221 0.260077 0.063716
3 0.981630 0.203194 0.927129 0.178575
4 0.104788 0.826646 0.336700 0.836136
In [ ]:
# Make a histogram of the columns
px.histogram(data1, x="A").show()
In [ ]:
fig2 = px.scatter_matrix(data1).show()
In [ ]:
px.scatter_3d(data1, x="A", y="B", z="C", color="D").show()

For many examples of really cool vizualizations that are easy to do (and have code samples), see the plotly express documentation.

Cell Magics¶

There are many forms of cell magics that provide convenience features.

If you find yourself getting errors for a file not being found, it may help to know where the working directory is. You can use the %pwd magic.

In [ ]:
%pwd
Out[ ]:
'/workspaces/carma_python/notebooks'

A really common issue with large text datasets is that some things take a long time to run. To know how long that is, we can use the %%time magic to get the time a cell takes to run. Do note how we're using two percent signs: %%. That makes the magic apply to the cell, instead of just the rest of the line.

In [ ]:
%%time

# Use time.sleep() to make this cell take some time.
time.sleep(2)
print("Done!")
Done!
CPU times: user 1.34 ms, sys: 260 µs, total: 1.6 ms
Wall time: 2 s

Sharing notebooks¶

We have a few options to share our notebooks with others.

  1. If we want them to be able to run the code themselves, we should share the notebook file (.ipynb). Often, we will also need to send an environment file (environment.yml) and any data files that we rely on. Since data files may be large, you will often want to use a service like Dropbox to send them.
  2. If we only want to show the contents, we can export a version in html.
    1. In the menu at the top of the Jupyter page, click "File".
    2. Use your mouse to hover over "Export notebook as..."
    3. Click "Export Notebook to HTML".
    4. Share the HTML file. Note that some graphics seem to export correctly more often if you use "Restart Kernel and Run All Cells..." immediately before exporting.
  3. Also, if you add a Jupyter notebook to a Github repository, it can preview the contents, though your mileage may vary on visualizations.

Jupyter challenges¶

There are a few challenges when using Jupyter.

  1. The cell structure is flexible, but it will allow you to do things out of order, and anything you import or assign is still there, even if you change the code. To avoid this issue:
    1. Purposefully try to keep your code in the order that it should be run.
    2. Periodically, use "Restart Kernel and Run All Cells..." to make sure that your notebook runs in order.
  2. It is not ideal for version controlled environments like Github. If you want to version control projects, you may find yourself moving code into .py files and using Jupyter for prototyping. Otherwise, this may not be a big concern.
  3. It is not a good format for a manuscript (unlike R notebooks). However, it is much more capable.

Overall, the Jupyter notebook is a great tool. Once you have some experience using them, you may find them fairly natural to work with.

Breakout Exercises¶

Let's do a few exercises to reinforce the concepts we learned above.

  1. import and namespace
  2. markdown
  3. Exporting HTML

EX1: import and namespace¶

We saw above how to import a package and inspect the namespace of it. Later in the course, we will be using the pynytimes package. Let's use it for an example here.

  1. Import the pynytimes package.
  2. Inspect the namespace. Which object do you think helps us create a connection to the article API?
In [ ]:
# 1-1 code
In [ ]:
# 1-2 code

EX2: markdown¶

Rememeber that we can use markdown to have rich text features. Let's try it.

  1. Make sure the cell below is a markdown cell.
  2. Enter the following sentence: "Getting free excerpts from the New York Times is cool, and the readme for the package can be found here."
  3. Make the word "free" italicized and the word "cool" bold.
  4. Make the phrase "found here" a link to https://github.com/michadenheijer/pynytimes.
In [ ]:
 

EX3: Exporting HTML¶

Many coauthors will be unfamiliar with using Jupyter notebooks, and it may not be a good time investment to have them set it up and learn how it works, only to review your work. However, if they can read it, a lot of the code will make sense. An easy way to share it is to export an HTML file that they can view in a web browser.

  1. Export this page as an HTML file.
  2. Using your computer's file browsing app, find the exported HTML file and double-click it to open it in your browser.